Skip to content

[Core] Fix deadlock in garbage collection when holding lock#60014

Merged
edoakes merged 2 commits intoray-project:masterfrom
RedGrey1993:fix/dataclient_deadlock
Jan 13, 2026
Merged

[Core] Fix deadlock in garbage collection when holding lock#60014
edoakes merged 2 commits intoray-project:masterfrom
RedGrey1993:fix/dataclient_deadlock

Conversation

@RedGrey1993
Copy link
Contributor

Description

This PR fixes a critical deadlock issue in Ray Client that occurs when garbage collection triggers ClientObjectRef.__del__() while the DataClient lock is held.

When using Ray Client, a deadlock can occur in the following scenario:

  1. Main thread acquires DataClient.lock (e.g., in _async_send())
  2. Garbage collection is triggered while holding the lock
  3. GC calls ClientObjectRef.__del__()
  4. __del__() attempts to call call_release() → _release_server() → DataClient.ReleaseObject()
  5. ReleaseObject() tries to acquire the same DataClient.lock
  6. Deadlock: The same thread tries to acquire a non-reentrant lock it already holds

Related issues

Fixes #59643

Additional information

This PR implements a deferred release pattern that completely avoids the deadlock:

  1. Deferred Release Queue: Introduces _release_queue (a thread-safe queue.SimpleQueue) to collect object IDs that need to be released
  2. Background Release Thread: Adds _release_thread that processes the release queue asynchronously
  3. Non-blocking __del__: ClientObjectRef.__del__() now only puts IDs into the queue (no lock acquisition)

@RedGrey1993 RedGrey1993 requested a review from a team as a code owner January 9, 2026 21:17
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a deferred release mechanism using a background thread to fix a critical deadlock during garbage collection. The approach is sound, and the new test case effectively reproduces the issue and validates the fix. I have identified a potential resource leak where object IDs in the release batch may not be flushed upon worker shutdown. Additionally, I've suggested an improvement to the close method to log a warning if the release thread doesn't terminate gracefully, which will help in debugging potential future issues.

Signed-off-by: redgrey1993 <ulyer555@hotmail.com>
@RedGrey1993 RedGrey1993 force-pushed the fix/dataclient_deadlock branch from f66a1cb to cb6e929 Compare January 9, 2026 21:52
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Jan 9, 2026
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @RedGrey1993!

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jan 10, 2026
@RedGrey1993 RedGrey1993 force-pushed the fix/dataclient_deadlock branch from 755b4b0 to 5f4a5b5 Compare January 10, 2026 01:34
Signed-off-by: redgrey1993 <ulyer555@hotmail.com>
@RedGrey1993 RedGrey1993 force-pushed the fix/dataclient_deadlock branch from 5f4a5b5 to 5f0e7bb Compare January 10, 2026 01:41
@RedGrey1993 RedGrey1993 requested a review from edoakes January 10, 2026 01:54
@RedGrey1993
Copy link
Contributor Author

@edoakes Thanks for the review. I've updated the code according to your suggestions. Please review again at your convenience.

@edoakes edoakes merged commit f512139 into ray-project:master Jan 13, 2026
6 checks passed
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
…ect#60014)

## Description
This PR fixes a critical deadlock issue in Ray Client that occurs when
garbage collection triggers `ClientObjectRef.__del__()` while the
DataClient lock is held.

When using Ray Client, a deadlock can occur in the following scenario:

  1. Main thread acquires DataClient.lock (e.g., in _async_send())
  2. Garbage collection is triggered while holding the lock
  3. GC calls `ClientObjectRef.__del__()`
4. `__del__()` attempts to call call_release() → _release_server() →
DataClient.ReleaseObject()
  5. ReleaseObject() tries to acquire the same DataClient.lock
6. Deadlock: The same thread tries to acquire a non-reentrant lock it
already holds

## Related issues
> Fixes ray-project#59643

## Additional information
This PR implements a deferred release pattern that completely avoids the
deadlock:

1. Deferred Release Queue: Introduces _release_queue (a thread-safe
queue.SimpleQueue) to collect object IDs that need to be released
2. Background Release Thread: Adds _release_thread that processes the
release queue asynchronously
3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs
into the queue (no lock acquisition)

---------

Signed-off-by: redgrey1993 <ulyer555@hotmail.com>
Co-authored-by: redgrey1993 <ulyer555@hotmail.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
rushikeshadhav pushed a commit to rushikeshadhav/ray that referenced this pull request Jan 14, 2026
…ect#60014)

## Description
This PR fixes a critical deadlock issue in Ray Client that occurs when
garbage collection triggers `ClientObjectRef.__del__()` while the
DataClient lock is held.

When using Ray Client, a deadlock can occur in the following scenario:

  1. Main thread acquires DataClient.lock (e.g., in _async_send())
  2. Garbage collection is triggered while holding the lock
  3. GC calls `ClientObjectRef.__del__()`
4. `__del__()` attempts to call call_release() → _release_server() →
DataClient.ReleaseObject()
  5. ReleaseObject() tries to acquire the same DataClient.lock
6. Deadlock: The same thread tries to acquire a non-reentrant lock it
already holds

## Related issues
> Fixes ray-project#59643 

## Additional information
This PR implements a deferred release pattern that completely avoids the
deadlock:

1. Deferred Release Queue: Introduces _release_queue (a thread-safe
queue.SimpleQueue) to collect object IDs that need to be released
2. Background Release Thread: Adds _release_thread that processes the
release queue asynchronously
3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs
into the queue (no lock acquisition)

---------

Signed-off-by: redgrey1993 <ulyer555@hotmail.com>
Co-authored-by: redgrey1993 <ulyer555@hotmail.com>
@RedGrey1993 RedGrey1993 deleted the fix/dataclient_deadlock branch January 15, 2026 00:28
jeffery4011 pushed a commit to jeffery4011/ray that referenced this pull request Jan 20, 2026
…ect#60014)

## Description
This PR fixes a critical deadlock issue in Ray Client that occurs when
garbage collection triggers `ClientObjectRef.__del__()` while the
DataClient lock is held.

When using Ray Client, a deadlock can occur in the following scenario:

  1. Main thread acquires DataClient.lock (e.g., in _async_send())
  2. Garbage collection is triggered while holding the lock
  3. GC calls `ClientObjectRef.__del__()`
4. `__del__()` attempts to call call_release() → _release_server() →
DataClient.ReleaseObject()
  5. ReleaseObject() tries to acquire the same DataClient.lock
6. Deadlock: The same thread tries to acquire a non-reentrant lock it
already holds

## Related issues
> Fixes ray-project#59643

## Additional information
This PR implements a deferred release pattern that completely avoids the
deadlock:

1. Deferred Release Queue: Introduces _release_queue (a thread-safe
queue.SimpleQueue) to collect object IDs that need to be released
2. Background Release Thread: Adds _release_thread that processes the
release queue asynchronously
3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs
into the queue (no lock acquisition)

---------

Signed-off-by: redgrey1993 <ulyer555@hotmail.com>
Co-authored-by: redgrey1993 <ulyer555@hotmail.com>
Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
…ect#60014)

## Description
This PR fixes a critical deadlock issue in Ray Client that occurs when
garbage collection triggers `ClientObjectRef.__del__()` while the
DataClient lock is held.

When using Ray Client, a deadlock can occur in the following scenario:

  1. Main thread acquires DataClient.lock (e.g., in _async_send())
  2. Garbage collection is triggered while holding the lock
  3. GC calls `ClientObjectRef.__del__()`
4. `__del__()` attempts to call call_release() → _release_server() →
DataClient.ReleaseObject()
  5. ReleaseObject() tries to acquire the same DataClient.lock
6. Deadlock: The same thread tries to acquire a non-reentrant lock it
already holds

## Related issues
> Fixes ray-project#59643 

## Additional information
This PR implements a deferred release pattern that completely avoids the
deadlock:

1. Deferred Release Queue: Introduces _release_queue (a thread-safe
queue.SimpleQueue) to collect object IDs that need to be released
2. Background Release Thread: Adds _release_thread that processes the
release queue asynchronously
3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs
into the queue (no lock acquisition)

---------

Signed-off-by: redgrey1993 <ulyer555@hotmail.com>
Co-authored-by: redgrey1993 <ulyer555@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] Deadlock in DataClient due to recursive lock acquisition during garbage collection (__del__)

2 participants